Take-Home Exercise 3. Predicting HDB Resale Prices with Geographically Weighted Machine Learning Methods

Author

ZHANG Jinghan

Modified

November 15, 2024

01 Overview

1.1 Background

Housing is an essential component of household wealth worldwide. Buying a housing has always been a major investment for most people. The price of housing is affected by many factors. Some of them are global in nature such as the general economy of a country or inflation rate. Others can be more specific to the properties themselves. These factors can be further divided to structural and locational factors. Structural factors are variables related to the property themselves such as the size, fitting, and tenure of the property. Locational factors are variables related to the neighbourhood of the properties such as proximity to childcare centre, public transport service and shopping centre.

Conventional, housing resale prices predictive models were built by using Ordinary Least Square (OLS) method. However, this method failed to take into consideration that spatial autocorrelation and spatial heterogeneity exist in geographic data sets such as housing transactions. With the existence of spatial autocorrelation, the OLS estimation of predictive housing resale pricing models could lead to biased, inconsistent, or inefficient results (Anselin 1998). In view of this limitation, Geographical Weighted Models were introduced to better calibrate predictive models for housing resale prices.

1.2 Objective

This study aims to calibrate a predictive model to predict HDB resale prices between July-September 2024 by using HDB resale transaction records in 2023.

02 Getting Started

2.1 Data

  • Asspatial dataset:

    HDB Resale Flat Prices provided by Data.gov.sg will be used as the core data set. The study should focus on 3 room flat (2b1b)

  • Geospatial dataset:

    • MP14_SUBZONE_WEB_PL: a polygon feature data providing information of URA 2014 Master Plan Planning Subzone boundary data. It is in ESRI shapefile format. This data set was also downloaded from Data.gov.sg
  • Locational factors with geographic coordinates:

    • Downloaded from Data.gov.sg.

      • Hawker Centre data is a list of hawker centres in Singapore. It is in geojson format.

      • Parks data is a list of parks in Singapore. It is in geojson format.

      • Supermarket data is a list of supermarkets in Singapore. It is in geojson format.

      • CHAS clinics data is a list of CHAS clinics in Singapore. It is in geojson format.

    • Downloaded from Datamall.lta.gov.sg.

      • MRT data is a list of MRT/LRT stations in Singapore with the station names and codes. It is in shapefile format.

      • Bus stops data is a list of bus stops in Singapore. It is in shapefile format.

  • Locational factors without geographic coordinates:

    • Downloaded from Data.gov.sg.

    • Retrieved/Scraped from other sources

      • CBD coordinates obtained from Google.

      • Shopping malls data is a list of Shopping malls in Singapore obtained from Wikipedia.

      • Good primary schools is a list of primary schools that are ordered in ranking in terms of popularity and this can be found at Local Salary Forum.

2.2 Packages

Here’s an overview of packages:

  1. sf: Manages spatial vector data (points, lines, polygons) in R.
  2. spdep: Analyzes spatial dependence and autocorrelation in data.
  3. GWmodel: Builds geographically weighted models, like GWR and GWGLM.
  4. SpatialML: Supports spatial machine learning and predictive modeling.
  5. tmap: Creates static and interactive maps for spatial data visualization.
  6. rsample: Helps with data splitting for model training and validation.
  7. Metrics: Provides performance metrics like MSE and RMSE for model evaluation.
  8. tidyverse: A suite of tools for data manipulation, cleaning, and visualization.
  9. httr: Handles HTTP requests for web data and APIs.
  10. jsonlite: Parses JSON data into R objects.
  11. rvest: Simplifies web scraping in R.
  12. units: Manages and converts measurement units.
  13. matrixStats: Efficient functions for matrix and vector stats.
  14. ggpubr: Adds publication-ready tools to ggplot2.
  15. car: Tools for regression diagnostics and hypothesis tests.
pacman::p_load(sf, spdep, GWmodel, SpatialML, 
               tmap, rsample, Metrics, tidyverse,
               httr,jsonlite, rvest, units, matrixStats, ggpubr,car)

03 Data

3.1 Import Data

3.1.1 Geospatial Data

URA 2014 Master Plan Planning Subzone boundary data

mpsz = st_read(dsn = "data/geospatial", layer = "MP14_SUBZONE_WEB_PL")
Reading layer `MP14_SUBZONE_WEB_PL' from data source 
  `C:\JinghanZzz\ISSS626-GeoAnalytics\Take-home_Ex\Take-Home_Ex03\data\geospatial' 
  using driver `ESRI Shapefile'
Simple feature collection with 323 features and 15 fields
Geometry type: MULTIPOLYGON
Dimension:     XY
Bounding box:  xmin: 2667.538 ymin: 15748.72 xmax: 56396.44 ymax: 50256.33
Projected CRS: SVY21

Hawker Centre data is a list of hawker centres in Singapore. It is in geojson format.

hawker_center <- st_read("data/geospatial/HawkerCentresGEOJSON.geojson") %>%
  st_transform(crs = 3414)
Reading layer `HawkerCentresGEOJSON' from data source 
  `C:\JinghanZzz\ISSS626-GeoAnalytics\Take-home_Ex\Take-Home_Ex03\data\geospatial\HawkerCentresGEOJSON.geojson' 
  using driver `GeoJSON'
Simple feature collection with 125 features and 2 fields
Geometry type: POINT
Dimension:     XYZ
Bounding box:  xmin: 103.6974 ymin: 1.272716 xmax: 103.9882 ymax: 1.449017
z_range:       zmin: 0 zmax: 0
Geodetic CRS:  WGS 84

Parks data is a list of parks in Singapore. It is in geojson format.

parks <- st_read("data/geospatial/Parks.geojson") %>%
  st_transform(crs = 3414)
Reading layer `Parks' from data source 
  `C:\JinghanZzz\ISSS626-GeoAnalytics\Take-home_Ex\Take-Home_Ex03\data\geospatial\Parks.geojson' 
  using driver `GeoJSON'
Simple feature collection with 430 features and 2 fields
Geometry type: POINT
Dimension:     XYZ
Bounding box:  xmin: 103.6929 ymin: 1.214491 xmax: 104.0538 ymax: 1.462094
z_range:       zmin: 0 zmax: 0
Geodetic CRS:  WGS 84

Supermarket data is a list of supermarkets in Singapore. It is in geojson format.

supermarkets <- st_read("data/geospatial/SupermarketsGEOJSON.geojson") %>%
  st_transform(crs = 3414)
Reading layer `SupermarketsGEOJSON' from data source 
  `C:\JinghanZzz\ISSS626-GeoAnalytics\Take-home_Ex\Take-Home_Ex03\data\geospatial\SupermarketsGEOJSON.geojson' 
  using driver `GeoJSON'
Simple feature collection with 526 features and 2 fields
Geometry type: POINT
Dimension:     XYZ
Bounding box:  xmin: 103.6258 ymin: 1.24715 xmax: 104.0036 ymax: 1.461526
z_range:       zmin: 0 zmax: 0
Geodetic CRS:  WGS 84

MRT data is a list of MRT/LRT stations in Singapore with the station names and codes. It is in shapefile format.

mrt <- st_read(dsn = "data/geospatial", layer = "RapidTransitSystemStation") %>%
  st_transform(crs = 3414) %>%  
  st_zm()                  
Reading layer `RapidTransitSystemStation' from data source 
  `C:\JinghanZzz\ISSS626-GeoAnalytics\Take-home_Ex\Take-Home_Ex03\data\geospatial' 
  using driver `ESRI Shapefile'
Warning in CPL_read_ogr(dsn, layer, query, as.character(options), quiet, : GDAL
Message 1: Non closed ring detected. To avoid accepting it, set the
OGR_GEOMETRY_ACCEPT_UNCLOSED_RING configuration option to NO
Simple feature collection with 231 features and 7 fields
Geometry type: POLYGON
Dimension:     XY
Bounding box:  xmin: 6068.209 ymin: 27478.44 xmax: 45377.5 ymax: 47913.58
Projected CRS: SVY21

The warning says some place is not closed,we use make valid to fix it

# find invalid
invalid_geom <- mrt[!st_is_valid(mrt), ]

#Check how many are missing
print(nrow(invalid_geom))
[1] 3
print(invalid_geom)
Simple feature collection with 3 features and 7 fields (with 1 geometry empty)
Geometry type: POLYGON
Dimension:     XY
Bounding box:  xmin: 26568.45 ymin: 27478.44 xmax: 28080.89 ymax: 37543.25
Projected CRS: SVY21 / Singapore TM
    TYP_CD STN_NAM       ATTACHEMEN SHAPE_AREA SHAPE_LEN TYP_CD_DES
109      0    <NA> CC29_HBF STN.zip   10146.62  670.0397        MRT
NA      NA    <NA>             <NA>         NA        NA       <NA>
162      0    <NA>             <NA>   10070.02 1040.8044        MRT
                   STN_NAM_DE                       geometry
109  HARBOURFRONT MRT STATION POLYGON ((26736.44 27495.44...
NA                       <NA>                  POLYGON EMPTY
162 UPPER THOMSON MRT STATION POLYGON ((27808.12 37518.2,...
# 使用 st_make_valid() 单独修复无效的几何图形
#mrt[!st_is_valid(mrt), ] <- st_make_valid(mrt[!st_is_valid(mrt), ])

I tried the make_invalid but it didn’t work. So we check the invalid ones and we add some buffer to make it close. The buffer was 10 because considering 10m to the scale of a whole country, it would be acceptable.

invalid_geom <- invalid_geom %>%
  filter(!is.na(STN_NAM_DE))
valid_geom <- invalid_geom %>%
  st_buffer(dist = 10)  # add 10 meter buffer to close it
# delete the 2 invalid rows from `mrt` 
mrt <- mrt %>%
  filter(!(STN_NAM_DE %in% c("HARBOURFRONT MRT STATION", "UPPER THOMSON MRT STATION")))

# append `valid_geom` to `mrt` 
mrt <- bind_rows(mrt, valid_geom)

I observed the data and see Ulu Pandan Depot and Mandai Depot which I never heard of even if I checked before I went to the Zoo. Then GPT it to find that they are actually warehouse and place to fix the train like when green line shut down. So filter it to keep station only

mrt <- mrt %>%
  filter(str_detect(STN_NAM_DE, "STATION$"))

mrt_sf <- mrt%>%                   
  select(geometry) 

Bus stops

bus_sf <- st_read(dsn = "data/geospatial", layer = "BusStop") %>%
  st_transform(crs = 3414) %>%  # 转换到 EPSG 3414
  st_zm() %>%                   # 去掉 Z 维度
  select(geometry)  
Reading layer `BusStop' from data source 
  `C:\JinghanZzz\ISSS626-GeoAnalytics\Take-home_Ex\Take-Home_Ex03\data\geospatial' 
  using driver `ESRI Shapefile'
Simple feature collection with 5166 features and 2 fields
Geometry type: POINT
Dimension:     XY
Bounding box:  xmin: 3970.122 ymin: 26482.1 xmax: 48285.52 ymax: 52983.82
Projected CRS: SVY21

3.1.2 Aspatial Data

Primary Schools

The primary schools infomation is extracted from General information of schools from gov.data, and I have processed it in excel to drop some column and picked out primary and mixed level school(in case missing some primary schools)

allschools <- read.csv("data/aspatial/Generalinformationofschools.csv")
head(allschools)
                   school_name                      address postal_code
1     ADMIRALTY PRIMARY SCHOOL        11   WOODLANDS CIRCLE      738907
2 AHMAD IBRAHIM PRIMARY SCHOOL        10   YISHUN STREET 11      768643
3               AI TONG SCHOOL       100  Bright Hill Drive      579646
4     ALEXANDRA PRIMARY SCHOOL 2A   Prince Charles Crescent      159016
5  ANCHOR GREEN PRIMARY SCHOOL        31   Anchorvale Drive      544969
6      ANDERSON PRIMARY SCHOOL        19   ANG MO KIO AVE 9      569785
  mainlevel_code
1        PRIMARY
2        PRIMARY
3        PRIMARY
4        PRIMARY
5        PRIMARY
6        PRIMARY

Top Primary Schools

This data comes from a list of primary schools that are ordered in ranking in terms of popularity and this can be found at Local Salary Forum. According to Wikipedia, there are about 180 primary schools in Singapore, so the top 15% percentile will be around rank 27, therefore I include top 25 into the study.

topschools_csv <- read.csv("data/aspatial/topschools.csv")
head(topschools_csv)
                     school_name
1 Holy Innocents’ Primary School
2                 Ai Tong School
3       Nan Chiau Primary School
4                 Chongfu School
5         Nanyang Primary School
6       CHIJ Primary (Toa Payoh)

But the geometry is missing, so we need to use API to get the geometry data. The code was written by

HDB Resale Data

resale <- read_csv("data/aspatial/ResaleflatpricesbasedonregistrationdatefromJan2017onwards.csv") %>%
  filter(month >= "2023-01" & month <= "2024-09")
Rows: 193496 Columns: 11
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (8): month, town, flat_type, block, street_name, storey_range, flat_mode...
dbl (3): floor_area_sqm, lease_commence_date, resale_price

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

3.2 Data Wrangling

3.2.1 One Map API

Since some feature has no coordinates and geometry, the API code developed by Prof. Kam Tin Seong from SMU will be used here.

get_coords <- function(add_list){
  
  # Create a data frame to store all retrieved coordinates
  postal_coords <- data.frame()
    
  for (i in add_list){
    #print(i)

    r <- GET('https://www.onemap.gov.sg/api/common/elastic/search?',
           query=list(searchVal=i,
                     returnGeom='Y',
                     getAddrDetails='Y'))
    data <- fromJSON(rawToChar(r$content))
    found <- data$found
    res <- data$results
    
    # Create a new data frame for each address
    new_row <- data.frame()
    
    # If single result, append 
    if (found == 1){
      postal <- res$POSTAL 
      lat <- res$LATITUDE
      lng <- res$LONGITUDE
      new_row <- data.frame(address= i, 
                            postal = postal, 
                            latitude = lat, 
                            longitude = lng)
    }
    
    # If multiple results, drop NIL and append top 1
    else if (found > 1){
      # Remove those with NIL as postal
      res_sub <- res[res$POSTAL != "NIL", ]
      
      # Set as NA first if no Postal
      if (nrow(res_sub) == 0) {
          new_row <- data.frame(address= i, 
                                postal = NA, 
                                latitude = NA, 
                                longitude = NA)
      }
      
      else{
        top1 <- head(res_sub, n = 1)
        postal <- top1$POSTAL 
        lat <- top1$LATITUDE
        lng <- top1$LONGITUDE
        new_row <- data.frame(address= i, 
                              postal = postal, 
                              latitude = lat, 
                              longitude = lng)
      }
    }

    else {
      new_row <- data.frame(address= i, 
                            postal = NA, 
                            latitude = NA, 
                            longitude = NA)
    }
    
    # Add the row
    postal_coords <- rbind(postal_coords, new_row)
  }
  return(postal_coords)
}

3.2.2 Get Latest position of Resold HDB

resale_tidy <- resale %>%
  mutate(address = paste(block,street_name)) %>%
  mutate(remaining_lease_yr = as.integer(
    str_sub(remaining_lease, 0, 2)))%>%
  mutate(remaining_lease_mth = as.integer(
    str_sub(remaining_lease, 9, 11)))

Select data from 2023 and 2024 Jul to Sep as required by the task, and filter out 3 room resale data as our focus

# Filter the data from resale_tidy for the year 2023 and for Jul-Sep 2024,
# only keeping records where flat_type is "3 ROOM"
resale_selected <- resale_tidy %>%
  filter(
    # Select all records from the year 2023 or from Jul-Sep 2024
    (str_starts(month, "2023") | month %in% c("2024-07", "2024-08", "2024-09")) &
    flat_type == "3 ROOM"  # Only keep records where flat_type is "3 ROOM"
  )

# Display the filtered results
head(resale_selected)
# A tibble: 6 × 14
  month town  flat_type block street_name storey_range floor_area_sqm flat_model
  <chr> <chr> <chr>     <chr> <chr>       <chr>                 <dbl> <chr>     
1 2023… ANG … 3 ROOM    225   ANG MO KIO… 04 TO 06                 67 New Gener…
2 2023… ANG … 3 ROOM    310C  ANG MO KIO… 25 TO 27                 70 Model A   
3 2023… ANG … 3 ROOM    225   ANG MO KIO… 07 TO 09                 67 New Gener…
4 2023… ANG … 3 ROOM    319   ANG MO KIO… 04 TO 06                 73 New Gener…
5 2023… ANG … 3 ROOM    319   ANG MO KIO… 07 TO 09                 73 New Gener…
6 2023… ANG … 3 ROOM    220   ANG MO KIO… 04 TO 06                 67 New Gener…
# ℹ 6 more variables: lease_commence_date <dbl>, remaining_lease <chr>,
#   resale_price <dbl>, address <chr>, remaining_lease_yr <int>,
#   remaining_lease_mth <int>

Use the API above to get the coordinate of resale HDB property

add_list <- sort(unique(resale_selected$address))

coords <- get_coords(add_list)

Append the coords back to resale

resale <- resale_selected %>%
  left_join(coords, by = c("address" = "address")) %>%
  st_as_sf(coords = c("longitude", "latitude"), crs = 4326) %>%
  st_transform(crs = 3414)
write_rds(resale, "data/rds/resale.rds")

Since we need to calculate the proximity later, we need to transform the data from WGS84 to singapore’s crs system

hdb <- read_rds("data/rds/resale.rds")

hdb_sf <- st_as_sf(hdb, coords = c("longitude", "latitude"), crs = 4326) %>%
  st_transform(crs = 3414)

head(hdb_sf)
Simple feature collection with 6 features and 15 fields
Geometry type: POINT
Dimension:     XY
Bounding box:  xmin: 28537.68 ymin: 38517.37 xmax: 29564.76 ymax: 38825.23
Projected CRS: SVY21 / Singapore TM
# A tibble: 6 × 16
  month town  flat_type block street_name storey_range floor_area_sqm flat_model
  <chr> <chr> <chr>     <chr> <chr>       <chr>                 <dbl> <chr>     
1 2023… ANG … 3 ROOM    225   ANG MO KIO… 04 TO 06                 67 New Gener…
2 2023… ANG … 3 ROOM    310C  ANG MO KIO… 25 TO 27                 70 Model A   
3 2023… ANG … 3 ROOM    225   ANG MO KIO… 07 TO 09                 67 New Gener…
4 2023… ANG … 3 ROOM    319   ANG MO KIO… 04 TO 06                 73 New Gener…
5 2023… ANG … 3 ROOM    319   ANG MO KIO… 07 TO 09                 73 New Gener…
6 2023… ANG … 3 ROOM    220   ANG MO KIO… 04 TO 06                 67 New Gener…
# ℹ 8 more variables: lease_commence_date <dbl>, remaining_lease <chr>,
#   resale_price <dbl>, address <chr>, remaining_lease_yr <int>,
#   remaining_lease_mth <int>, postal <chr>, geometry <POINT [m]>

3.2.3 Top Primary Schools

These data comes from a list of primary schools that are ordered in ranking in terms of popularity and this can be found at Local Salary Forum. According to Wikipedia, there are about 180 primary schools in Singapore, so the top 15% percentile will be around rank 27, therefore I include top 25 into the study.

First we select top schools from the complete list. To do that we need to clean the align the case and format of school names

clean_school_name <- function(name) {
  name %>%
    toupper() %>%                                     # 转换为大写
    str_replace_all("’", "'") %>%                     # 替换右弯引号为直引号
    str_replace_all("‘", "'") %>%                     # 替换左弯引号为直引号
    str_replace_all("ST\\.", "SAINT") %>%             # 替换 'St.' 为 'Saint'
    str_replace_all("[^A-Z0-9 ]", "") %>%             # 去除标点符号
    str_replace_all("PRIMARY SCHOOL|PRIMARY SECTION|PRIMARY", "") %>% # 移除描述词
    str_squish()                                      # 去除多余空格
}

#  clean the column school_name in allschools and topschools
allschools$school_name <- sapply(allschools$school_name, clean_school_name)
topschools_csv$school_name <- sapply(topschools_csv$school_name, clean_school_name)

# filter by capital case
topschools <- allschools[allschools$school_name %in% topschools_csv$school_name, ]

print(topschools)
                         school_name                       address postal_code
3                     AI TONG SCHOOL        100  Bright Hill Drive      579646
9         ANGLOCHINESE SCHOOL JUNIOR            16   WINSTEDT ROAD      227988
10               ANGLOCHINESE SCHOOL              50   BARKER ROAD      309918
24              CATHOLIC HIGH SCHOOL         9    BISHAN STREET 22      579767
33                    CHIJ TOA PAYOH       628  Lorong 1 Toa Payoh      319765
34  CHIJ SAINT NICHOLAS GIRLS SCHOOL     501  ANG MO KIO STREET 13      569405
35                    CHONGFU SCHOOL          170  YISHUN AVENUE 6      768959
50        FAIRFIELD METHODIST SCHOOL               100  DOVER ROAD      139648
66                        HENRY PARK       1    HOLLAND GROVE ROAD      278790
67                    HOLY INNOCENTS          5    Lorong Low Koon      536451
81                   KONG HWA SCHOOL          350  GUILLEMARD ROAD      399772
83            KUO CHUAN PRESBYTERIAN         8    BISHAN STREET 13      579793
87          MARIS STELLA HIGH SCHOOL        25   MOUNT VERNON ROAD      368051
93            METHODIST GIRLS SCHOOL          11   Blackmore Drive      599986
95                         NAN CHIAU          50   ANCHORVALE LINK      545080
96                           NAN HUA            30   Jalan Lempeng      128806
97                           NANYANG              52   KING'S ROAD      268097
115           PEI CHUN PUBLIC SCHOOL       16   LORONG 7 TOA PAYOH      319320
132              RED SWASTIKA SCHOOL     350  BEDOK NORTH AVENUE 3      469719
137                    ROSYTH SCHOOL 21   Serangoon North Avenue 4      555855
138                           RULANG    6    JURONG WEST STREET 52      649295
145          SINGAPORE CHINESE GIRLS             190  DUNEARN ROAD      309437
152                     SAINT HILDAS           2    Tampines Ave 3      529706
154 SAINT JOSEPHS INSTITUTION JUNIOR               3    ESSEX ROAD      309331
160                   TAO NAN SCHOOL          49   MARINE CRESCENT      449761
    mainlevel_code
3          PRIMARY
9          PRIMARY
10         PRIMARY
24    MIXED LEVELS
33         PRIMARY
34    MIXED LEVELS
35         PRIMARY
50         PRIMARY
66         PRIMARY
67         PRIMARY
81         PRIMARY
83         PRIMARY
87    MIXED LEVELS
93         PRIMARY
95         PRIMARY
96         PRIMARY
97         PRIMARY
115        PRIMARY
132        PRIMARY
137        PRIMARY
138        PRIMARY
145        PRIMARY
152        PRIMARY
154        PRIMARY
160        PRIMARY
add_list <- sort(unique(topschools$address))
#|eval: False
add_list <- sort(unique(topschools$address))

coords_topschools <- get_coords(add_list)

write_rds(coords_topschools, "data/rds/topschools.rds")
topschools <- read_rds("data/rds/topschools.rds")

topschools_sf <- st_as_sf(topschools, coords = c("longitude", "latitude"), crs = 4326) %>%
  st_transform(crs = 3414)

#drop 1 and 2 columns to keep only the geometry
topschools_sf <- topschools_sf[, -c(1, 2)]

head(topschools_sf)
Simple feature collection with 6 features and 0 fields
Geometry type: POINT
Dimension:     XY
Bounding box:  xmin: 22442.59 ymin: 31484.03 xmax: 30450.9 ymax: 38071.92
Projected CRS: SVY21 / Singapore TM
                   geometry
1 POINT (22544.29 33216.96)
2 POINT (27966.81 38071.92)
3 POINT (22673.28 31484.03)
4 POINT (22442.59 34984.58)
5  POINT (30450.9 35439.72)
6 POINT (28849.34 32406.84)

3.2.4 CBD

The CBD of Singapore encompasses areas like Marina Bay, Tanjong Pagar, and Shenton Way, where a lot’s of business locates. The Geospatial Data Science Lab at the National University of Singapore lists the CBD center coordinates as 1.2800°N, 103.8500°E. It is located in the Raffles Place area, which is considered the heart of the Singapore CBD.

Build a cbd_sf just like what we did to school and hdb

cbd <- data.frame(
  longitude = 103.8500,
  latitude = 1.2800
)

cbd_sf <- st_as_sf(cbd, coords = c("longitude", "latitude"), crs = 4326)

cbd_sf <- st_transform(cbd_sf, crs = st_crs(topschools_sf))

print(cbd_sf)
Simple feature collection with 1 feature and 0 fields
Geometry type: POINT
Dimension:     XY
Bounding box:  xmin: 29856.51 ymin: 29161.42 xmax: 29856.51 ymax: 29161.42
Projected CRS: SVY21 / Singapore TM
                   geometry
1 POINT (29856.51 29161.42)

3.2.5 Hawker center, Parks, and Supermarkets

For hawker_center,parks,supermarkets, the geometry is point Z instead of point, which means it’s 3D data point with the altitude. We need to drop it to align with other 2D data points.

Then we drop columns we are not going to use later

hawker_center

hawker_center_sf <- hawker_center %>%
  st_zm() %>%                  # 去掉 Z 维度
  select(geometry)             # 只保留 geometry 列

head(hawker_center_sf)
Simple feature collection with 6 features and 0 fields
Geometry type: POINT
Dimension:     XY
Bounding box:  xmin: 22042.51 ymin: 29650.7 xmax: 35955.52 ymax: 47850.43
Projected CRS: SVY21 / Singapore TM
                   geometry
1  POINT (29874.82 29650.7)
2 POINT (22042.51 46139.03)
3  POINT (24816.7 31094.91)
4  POINT (32867.9 41500.77)
5 POINT (35955.52 43336.13)
6 POINT (26794.39 47850.43)

Now we check the summary and see only X and Y are kept, and the crs is correct.

We repeat this process to Parks and Superarkets

parks_sf <- parks %>%
  st_zm() %>%                 
  select(geometry)             

head(parks_sf)
Simple feature collection with 6 features and 0 fields
Geometry type: POINT
Dimension:     XY
Bounding box:  xmin: 19531.82 ymin: 32755.86 xmax: 37908.12 ymax: 39763.61
Projected CRS: SVY21 / Singapore TM
                   geometry
1 POINT (22197.26 33106.05)
2 POINT (22371.87 32755.86)
3 POINT (22355.91 34333.61)
4 POINT (19531.82 39763.61)
5  POINT (21136.8 35335.69)
6 POINT (37908.12 35176.79)
supermarkets_sf <- supermarkets %>%
  st_zm() %>%                  # 去掉 Z 维度
  select(geometry)             # 只保留 geometry 列

head(supermarkets_sf)
Simple feature collection with 6 features and 0 fields
Geometry type: POINT
Dimension:     XY
Bounding box:  xmin: 32184.01 ymin: 32947.46 xmax: 41384.47 ymax: 42685.17
Projected CRS: SVY21 / Singapore TM
                   geometry
1 POINT (35561.22 42685.17)
2 POINT (32184.01 32947.46)
3 POINT (33903.48 39480.46)
4 POINT (37083.82 35017.47)
5  POINT (41320.3 37283.82)
6 POINT (41384.47 37152.14)

3.2.6 Add Proximity to Facilities

The proximity function will be applied to calculate the distance from each property to facilities

proximity <- function(df1, df2, varname) {
  dist_matrix <- st_distance(df1, df2) %>%
    drop_units()
  df1[,varname] <- rowMins(dist_matrix)
  return(df1)
}

The we calculate the proximity to CBD, mrt, bus, hawker centre, park, good primary schools, and supermarkets.

Assure that all the data are in EPSG:3414

# Assure that all the data are in EPSG:3414
hdb_sf <- st_transform(hdb_sf, 3414)
cbd_sf <- st_transform(cbd_sf, 3414)
mrt_sf <- st_transform(mrt_sf, 3414)
bus_sf <- st_transform(bus_sf, 3414)
hawker_center_sf <- st_transform(hawker_center, 3414)
topschools_sf <- st_transform(topschools_sf, 3414)
parks_sf <- st_transform(parks, 3414)
supermarkets_sf <- st_transform(supermarkets, 3414)
hdb_sf <- proximity(hdb_sf, cbd_sf, "PROX_CBD") %>%
  proximity(., mrt_sf, "PROX_MRT") %>%
  proximity(., bus_sf, "PROX_BUS") %>%
  proximity(., hawker_center_sf, "PROX_HAWKER") %>%
  proximity(., topschools_sf, "PROX_TOPSCHOOL") %>%
  proximity(., parks_sf, "PROX_PARK") %>%
  proximity(., supermarkets_sf, "PROX_MKT")

3.2.7 Add Number of Nearby Facilities

Number of facilities nearby may also contribute to the resale price.

count_facilities_within_radius <- function(df1, facilities, varname, radius) {
  # Calculate the distance matrix between df1 and facilities
  dist_matrix <- st_distance(df1, facilities) %>%
    drop_units() %>%
    as.data.frame()
  
  # Count facilities within the specified radius for each location in df1
  df1[[paste0(varname, "_count")]] <- rowSums(dist_matrix <= radius)
  return(df1)
}

We set the radius to 400 because we assume in most of time you cannot walk straightly from your home to destination. We assume the block is a square, if the long side is 400 then sum of two shorter side will be around 560 which is still within 5-10 min walking distance for a normal people.

However top school is special and walking distance wont’ be the most important consideration, we give higher tolerance to top school (25 out of 180+)

hdb_sf <- count_facilities_within_radius(hdb_sf, mrt_sf, "mrt", 600) %>%
  count_facilities_within_radius(., bus_sf, "bus", 600) %>%
  count_facilities_within_radius(., hawker_center_sf, "hawker", 600) %>%
  count_facilities_within_radius(., topschools_sf, "topschool", 1200) %>%
  count_facilities_within_radius(., parks_sf, "park", 600) %>%
  count_facilities_within_radius(., supermarkets_sf, "market",600 )

3.3 EDA

3.3.1 HDB Resale Price Using Statistical Graphics

First we plot the distribution of resale_price of 3 room HDB by using appropriate Exploratory Data Analysis (EDA) as shown in the code chunk below

ggplot(data=hdb_sf, aes(x=`resale_price`)) +
  geom_histogram(bins=20, color="black", fill="light blue")

This histogram shows the distribution of HDB resale prices (resale_price) in Singapore. Here is a description of the chart:

  • Price Distribution: Most resale prices are concentrated between 400,000 and 500,000 SGD, forming a prominent peak.

  • Long Tail: The distribution has a long right tail, with a small number of properties priced above 800,000 SGD, reaching up to 1,600,000 SGD, indicating a higher price range.

  • Skewness: The data is right-skewed, with the majority of properties priced on the lower end, and a rapid decrease in frequency as prices increase.

This chart indicates that the vast majority of HDB resale prices are relatively moderate, with only a few high-priced properties.

hdb_sf <- hdb_sf %>%
  mutate(`resale_price` = log(resale_price))
ggplot(data=hdb_sf, aes(x=`resale_price`)) +
  geom_histogram(bins=20, color="black", fill="light blue")

This one is much better! Very close to normal distribution. We will take this for regression later.

3.3.2 Multiple Histogram Plots distribution of variables

floor_area_sqm <- ggplot(data=hdb_sf, aes(x=floor_area_sqm)) + 
  geom_histogram(bins=20, color="black", fill="lightblue")

remaining_lease_yr <- ggplot(data=hdb_sf, aes(x=remaining_lease_yr)) + 
  geom_histogram(bins=20, color="black", fill="lightblue")

PROX_CBD <- ggplot(data=hdb_sf, aes(x=PROX_CBD)) + 
  geom_histogram(bins=20, color="black", fill="lightblue")

PROX_MRT  <- ggplot(data=hdb_sf, aes(x=PROX_MRT)) + 
  geom_histogram(bins=20, color="black", fill="lightblue")

PROX_BUS <- ggplot(data=hdb_sf, aes(x=PROX_BUS)) + 
  geom_histogram(bins=20, color="black", fill="lightblue")

PROX_HAWKER  <- ggplot(data=hdb_sf, aes(x=PROX_HAWKER)) + 
  geom_histogram(bins=20, color="black", fill="lightblue")

PROX_TOPSCHOOL <- ggplot(data=hdb_sf, aes(x=PROX_TOPSCHOOL)) + 
  geom_histogram(bins=20, color="black", fill="lightblue")

PROX_PARK  <- ggplot(data=hdb_sf, aes(x=PROX_PARK)) + 
  geom_histogram(bins=20, color="black", fill="lightblue")

PROX_MKT <- ggplot(data=hdb_sf, aes(x=PROX_MKT)) + 
  geom_histogram(bins=20, color="black", fill="lightblue")

mrt_count  <- ggplot(data=hdb_sf, aes(x=mrt_count)) + 
  geom_histogram(bins=20, color="black", fill="lightblue")

bus_count <- ggplot(data=hdb_sf, aes(x=bus_count)) + 
  geom_histogram(bins=20, color="black", fill="lightblue")

hawker_count  <- ggplot(data=hdb_sf, aes(x=hawker_count)) + 
  geom_histogram(bins=20, color="black", fill="lightblue")

topschool_count <- ggplot(data=hdb_sf, aes(x=topschool_count)) + 
  geom_histogram(bins=20, color="black", fill="lightblue")

park_count  <- ggplot(data=hdb_sf, aes(x=park_count)) + 
  geom_histogram(bins=20, color="black", fill="lightblue")

market_count  <- ggplot(data=hdb_sf, aes(x=market_count)) + 
  geom_histogram(bins=20, color="black", fill="lightblue")
# Arrange all plots in a grid
ggarrange(
  floor_area_sqm, remaining_lease_yr, PROX_CBD, PROX_MRT, PROX_BUS,
  PROX_HAWKER, PROX_TOPSCHOOL, PROX_PARK, PROX_MKT, mrt_count,
  bus_count, hawker_count, topschool_count, park_count, market_count,
  ncol = 3, nrow = 5
)

3.3.3 Drawing Statistical Point Map

Reveal the geospatial distribution HDB resale prices in Singapore. 

tmap_mode("view")
tmap mode set to interactive viewing
mpsz <- st_make_valid(mpsz)
tm_shape(mpsz)+
  tm_polygons() +
tm_shape(hdb_sf) +  
  tm_dots(col = "resale_price",
          alpha = 0.6,
          style="quantile") +
  tm_view(set.zoom.limits = c(11,14))
tmap_mode("plot")
tmap mode set to plotting

This map displays the spatial distribution of HDB resale prices across different areas in Singapore. Yellow represents lower resale prices, ranging from approximately 150,000 to 355,000; Light Brown to Dark Brown represents higher resale prices, with the darkest brown indicating the highest prices, up to about 1,568,000. Darker colors concentrated in certain areas, indicating clusters of higher-priced properties.

04 Multiple Linear Regression

4.1 Collinearity Analysis

It is nessacery to check correlation and remove variables with high correlation to improve accuracy

mdata <- hdb_sf %>%
  select(resale_price, geometry, month,
         floor_area_sqm, remaining_lease_yr, 
         PROX_CBD, PROX_MRT, PROX_BUS,PROX_HAWKER, PROX_TOPSCHOOL, PROX_PARK, PROX_MKT, 
         mrt_count,bus_count, hawker_count, topschool_count, park_count, market_count)
mdata_nogeo <- mdata %>%
  st_drop_geometry()
corrplot::corrplot(cor(mdata_nogeo[, 3:17]), 
                   diag = FALSE, 
                   order = "AOE",
                   tl.pos = "td", 
                   tl.cex = 0.1, 
                   method = "number", 
                   type = "upper")

The correlation matrix above shows that all the correlation values are below 0.65. Hence, there is no sign of multicolinearity. We check the VIF again to double check

model <- lm(resale_price ~ floor_area_sqm + remaining_lease_yr + PROX_CBD + PROX_MRT + PROX_BUS +
            PROX_HAWKER + PROX_TOPSCHOOL + PROX_PARK + PROX_MKT + mrt_count +
            bus_count + hawker_count + topschool_count + park_count + market_count, data = hdb_sf)
vif_values <- vif(model)
print(vif_values)
    floor_area_sqm remaining_lease_yr           PROX_CBD           PROX_MRT 
          1.034321           1.382169           1.837811           1.848312 
          PROX_BUS        PROX_HAWKER     PROX_TOPSCHOOL          PROX_PARK 
          1.061921           1.908965           1.728239           1.739462 
          PROX_MKT          mrt_count          bus_count       hawker_count 
          1.305292           1.923894           1.209827           2.175077 
   topschool_count         park_count       market_count 
          1.623932           1.651968           1.216358 

The results for vif_values show that all variables have Variance Inflation Factor (VIF) values below 10, indicating that multicollinearity is not a serious issue among these variables. Generally, VIF values below 10 are considered acceptable, and values below 5 indicate almost no multicollinearity.

export a rds data

write_rds(mdata, "data/model/mdata.rds")

4.2 Data Splitting

mdata <- read_rds("data/model/mdata.rds")

The entire data are split into training and test data sets with 65% and 35% respectively by using initial_split() of rsample package. rsample is one of the package of tigymodels.

# Filter 2023 data for training
train_data <- hdb_sf %>%
  filter(month >= "2023-01" & month <= "2023-12")

# Filter July to September 2024 data for testing
test_data <- hdb_sf %>%
  filter(month >= "2024-07" & month <= "2024-09")
write_rds(train_data, "data/model/train_data.rds")
write_rds(test_data, "data/model/test_data.rds")
train_data <- read_rds("data/model/train_data.rds")
test_data <- read_rds("data/model/test_data.rds")

4.3 Building a non-spatial multiple linear regression

price_mlr <- lm(resale_price ~ floor_area_sqm +
                  remaining_lease_yr + PROX_CBD + PROX_HAWKER +
                  PROX_MRT + PROX_PARK + PROX_MKT + mrt_count +
                  bus_count + hawker_count + topschool_count + 
                  park_count + market_count,
                data = hdb_sf)

summary(price_mlr)

Call:
lm(formula = resale_price ~ floor_area_sqm + remaining_lease_yr + 
    PROX_CBD + PROX_HAWKER + PROX_MRT + PROX_PARK + PROX_MKT + 
    mrt_count + bus_count + hawker_count + topschool_count + 
    park_count + market_count, data = hdb_sf)

Residuals:
     Min       1Q   Median       3Q      Max 
-1.58745 -0.06610 -0.00299  0.06146  0.67886 

Coefficients:
                     Estimate Std. Error t value Pr(>|t|)    
(Intercept)         1.184e+01  1.505e-02 786.511  < 2e-16 ***
floor_area_sqm      1.009e-02  1.737e-04  58.106  < 2e-16 ***
remaining_lease_yr  1.004e-02  7.861e-05 127.688  < 2e-16 ***
PROX_CBD           -1.964e-05  3.326e-07 -59.066  < 2e-16 ***
PROX_HAWKER        -2.555e-05  3.745e-06  -6.821 9.65e-12 ***
PROX_MRT           -1.002e-04  4.481e-06 -22.363  < 2e-16 ***
PROX_PARK          -4.726e-05  4.042e-06 -11.693  < 2e-16 ***
PROX_MKT            7.080e-05  6.577e-06  10.764  < 2e-16 ***
mrt_count          -4.537e-03  1.364e-03  -3.327 0.000882 ***
bus_count           1.759e-03  2.378e-04   7.397 1.53e-13 ***
hawker_count       -8.176e-03  1.547e-03  -5.286 1.28e-07 ***
topschool_count     1.134e-02  2.045e-03   5.546 3.01e-08 ***
park_count         -3.442e-03  1.099e-03  -3.133 0.001737 ** 
market_count        1.019e-02  7.765e-04  13.119  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.1022 on 8249 degrees of freedom
Multiple R-squared:  0.7332,    Adjusted R-squared:  0.7328 
F-statistic:  1744 on 13 and 8249 DF,  p-value: < 2.2e-16
Note

The linear regression model predicts resale_price using several variables such as floor_area_sqm, remaining_lease_yr, and proximity measures to amenities (PROX_CBD, PROX_HAWKER, PROX_MRT, etc.), along with various facility counts (mrt_count, bus_count, etc.). The model explains approximately 73.3% of the variance in resale_price (Multiple R-squared = 0.7332, Adjusted R-squared = 0.7328), indicating a strong fit.

Most coefficients are statistically significant (p < 0.001), suggesting that the predictors have a meaningful impact on resale_price. For instance: - Positive influence: floor_area_sqm and remaining_lease_yr both positively impact resale prices, as expected. - Negative influence: Proximity to CBD (PROX_CBD), MRT stations (PROX_MRT), parks (PROX_PARK), and hawker centers (PROX_HAWKER) generally shows a negative impact, potentially due to noise or congestion.

The residuals are relatively small, with a residual standard error of 0.1022, indicating that the model predictions closely align with the actual values. This model is statistically significant with an overall F-statistic of 1744 (p < 2.2e-16), confirming the model’s predictive strength.

4.4 Basic GWR Predictive Model

Convert the sf data.frame to SpatialPointDataFrame calibrate the gwr-based hedonic pricing model by using adaptive bandwidth and Gaussian kernel as shown in the code chunk below.

train_data_sp <- as_Spatial(train_data)
train_data_sp
class       : SpatialPointsDataFrame 
features    : 6355 
extent      : 11646.69, 45192.3, 28097.64, 48622.47  (xmin, xmax, ymin, ymax)
crs         : +proj=tmerc +lat_0=1.36666666666667 +lon_0=103.833333333333 +k=1 +x_0=28001.642 +y_0=38744.572 +ellps=WGS84 +towgs84=0,0,0,0,0,0,0 +units=m +no_defs 
variables   : 28
names       :   month,       town, flat_type, block,   street_name, storey_range, floor_area_sqm, flat_model, lease_commence_date,    remaining_lease,     resale_price,            address, remaining_lease_yr, remaining_lease_mth, postal, ... 
min values  : 2023-01, ANG MO KIO,    3 ROOM,     1, ALJUNIED CRES,     01 TO 03,             52,       DBSS,                1966,  42 years 01 month, 11.9183905730784,         1 BEACH RD,                 42,                   1, 050004, ... 
max values  : 2023-12,     YISHUN,    3 ROOM,  999B, YUAN CHING RD,     40 TO 42,            155,    Terrace,                2020, 95 years 06 months, 13.9978321147582, 999B BUANGKOK CRES,                 95,                  11, 824306, ... 

Compute adaptive bandwidth

# Perform adaptive bandwidth selection for geographically weighted regression
bw_adaptive <- bw.gwr(resale_price ~ floor_area_sqm +
                        remaining_lease_yr +
                        PROX_CBD + PROX_MRT + PROX_BUS +
                        PROX_HAWKER + PROX_TOPSCHOOL + PROX_PARK + PROX_MKT +
                        mrt_count + bus_count + hawker_count + topschool_count + park_count + market_count,
                      data =train_data,
                      approach = "CV",
                      kernel = "gaussian",
                      adaptive = TRUE,
                      longlat = FALSE)

The result shows that 46 neighbour points will be the optimal bandwidth to be used if adaptive bandwidth is used for this data set.

write_rds(bw_adaptive, "data/model/bw_adaptive.rds")

4.5 Preparing coordinates data

Extracting coordinates data

coords <- st_coordinates(mdata)
coords_train <- st_coordinates(train_data)
coords_test <- st_coordinates(test_data)
coords_train <- write_rds(coords_train, "data/model/coords_train.rds" )
coords_test <- write_rds(coords_test, "data/model/coords_test.rds" )

Droping geometry field

train_data <- train_data %>% 
  st_drop_geometry()

4.6 Geographical Random Forest Model

set.seed(1234)
gwRF_adaptive <- grf(formula = resale_price ~ floor_area_sqm +
                        remaining_lease_yr +
                        PROX_CBD + PROX_MRT + PROX_BUS +
                        PROX_HAWKER + PROX_TOPSCHOOL + PROX_PARK + PROX_MKT +
                        mrt_count + bus_count + hawker_count + topschool_count + park_count + market_count,
                     dframe=train_data, 
                     bw=46,
                     kernel="adaptive",
                     coords=coords_train)
Note
  • Model Summary:

  • Residuals Analysis:

    • OOB Residuals: Median and mean values close to zero, with a slight range in values, indicating low residual error in predictions.

    • Predicted Residuals (Non-OOB): Shows a similar trend with a low median and mean close to zero, suggesting minimal bias.

  • Local Variable Importance: The summary mentions local variable importance, which likely provides insights into how variable importance varies spatially.

  • Overall Metrics for Local Model:

    • MSE (OOB): 0.004 and R-squared (OOB): 88.98%, which is consistent with the global model’s performance.

    • Predicted MSE (Not OOB): 0.002, with an R-squared of 95.72% on hold-out data, indicating that the model generalizes well.

    • AIC/AICc: Both metrics are lower on the hold-out data than the OOB data, indicating a better fit on the validation data.

Interpretation

This geographically weighted random forest model performs well, with a high R-squared, low MSE, and significant variable importance concentrated on key factors like remaining_lease_yr, floor_area_sqm, and PROX_CBD. The model shows strong predictive power and generalizes well to unseen data (as indicated by the performance on non-OOB data).

write_rds(gwRF_adaptive, "data/model/gwRF_adaptive.rds")
gwRF_adaptive <- read_rds("data/model/gwRF_adaptive.rds")

4.7 Predicting by using test data

test_data <- cbind(test_data, coords_test) %>%
  st_drop_geometry()
gwRF_pred <- predict.grf(gwRF_adaptive, 
                           test_data, 
                           x.var.name="X",
                           y.var.name="Y", 
                           local.w=1,
                           global.w=0)
GRF_pred <- write_rds(gwRF_pred, "data/model/GRF_pred.rds")
GRF_pred <- read_rds("data/model/GRF_pred.rds")
GRF_pred_df <- as.data.frame(GRF_pred)

append the value back to data frame

test_data_p <- cbind(test_data, GRF_pred_df)
write_rds(test_data_p, "data/model/test_data_p.rds")

4.8 Calculating Root Mean Square Error

The root mean square error (RMSE) allows us to measure how far predicted values are from observed values in a regression analysis. In the code chunk below, rmse() of Metrics package is used to compute the RMSE.

rmse(test_data_p$resale_price, 
     test_data_p$GRF_pred)
[1] 0.1438254

4.9 Visualising the predicted values

Visualise the actual resale price and the predicted resale price.

ggplot(data = test_data_p,
       aes(x = GRF_pred,
           y = resale_price)) +
  geom_point()

05 Conclusion

This study provides an enhanced approach to predicting HDB resale prices by addressing the spatial characteristics often ignored in conventional models. Housing, as a significant component of household wealth, requires accurate predictive models to support better decision-making for both buyers and policymakers. Traditional models, such as Ordinary Least Squares (OLS), fail to account for the inherent spatial autocorrelation and heterogeneity in housing data, potentially leading to biased and inconsistent predictions. By adopting a Geographically Weighted Regression (GWR) model, this study leverages the spatial nuances in housing data to improve the accuracy and reliability of price predictions.

Through an analysis of HDB resale transaction records from 2023, we calibrated a geographically weighted random forest model to forecast resale prices for July to September 2024. The model integrates both structural factors (such as floor area and remaining lease duration) and locational factors (such as proximity to MRT stations, schools, and markets) to capture the key determinants of housing prices. The results demonstrate that structural factors like remaining lease years and floor area have a strong positive influence on resale prices, while locational factors, particularly proximity to the Central Business District (CBD) and MRT stations, also significantly impact prices. These findings align with expectations that both the quality and accessibility of a property contribute to its market value.

The geographically weighted model achieved high explanatory power, with an R-squared above 88% on out-of-bag data, indicating that the spatially adaptive approach successfully captures both local and global influences on housing prices. Compared to traditional OLS methods, this model provides a more nuanced understanding of the spatial heterogeneity in housing markets, accommodating variations in price determinants across different neighborhoods.

In summary, this study demonstrates the effectiveness of Geographically Weighted Models in improving predictive accuracy for HDB resale prices by accounting for spatial factors. These findings underscore the importance of incorporating both structural and locational variables in housing price models, providing valuable insights for stakeholders in housing markets. For future research, expanding the model with additional neighborhood characteristics or exploring temporal dynamics could further refine predictive accuracy and support more comprehensive housing market analysis.